1 Preparation

In the load_data chunk we have loaded all the data we will need for the analysis. We have 3 data.frames:

1.1 Matching RNA-seq samples to conditions

In the df.counts.raw data.frame, samples are identified by the sequence_id column:

sequence_id
S01
S02
S03
S04
S05
S06

In order to map sequence_id to sample_id we to combine the information in df.submission and df.design:

Submission file.
isolation_date sample_id sequence_id volume_ul conc_ng_ul A260 ratio_260_230 ratio_260_280 yield_ug
2020-01-19 191226_PC_2_1 S01 59 63.202 1.5800 1.985 2.061 3.728918
2020-01-19 191226_PC_2_3 S02 59 118.006 2.9502 1.454 2.224 6.962354
2020-01-19 191226_PC_2_4 S03 59 29.625 0.7400 1.305 2.095 1.747875
2020-01-19 200116_PC_3_1 S04 59 94.049 2.3512 2.197 2.233 5.548891
2020-01-19 200116_PC_3_3 S05 59 169.249 4.2312 2.255 2.153 9.985691
2020-01-19 200116_PC_3_4 S06 59 73.703 1.8426 2.185 2.108 4.348477
Design file.
multicultivator channel strain sample_time sample_id phosphate_mM purpose dataset
0 1 WT NA 191104_PC_1_1 0.2300 control pre-coure
0 1 WT NA 191226_PC_2_1 0.0115 treatment pre-coure
0 1 WT NA 200116_PC_3_1 0.2300 control pre-coure
0 2 WT NA 191104_PC_1_2 0.2300 control pre-coure
0 3 WT NA 191104_PC_1_3 0.2300 control pre-coure
0 3 WT NA 191226_PC_2_3 0.0115 treatment pre-coure

Create a data.frame called df.transcriptomics.design combining the required information:

sample_id sequence_id multicultivator channel strain sample_time phosphate_mM purpose dataset
191226_PC_2_1 S01 0 1 WT NA 0.0115 treatment pre-coure
191226_PC_2_3 S02 0 3 WT NA 0.0115 treatment pre-coure
191226_PC_2_4 S03 0 4 WT NA 0.0115 treatment pre-coure
200116_PC_3_1 S04 0 1 WT NA 0.2300 control pre-coure
200116_PC_3_3 S05 0 3 WT NA 0.2300 control pre-coure
200116_PC_3_4 S06 0 4 WT NA 0.2300 control pre-coure

2 Inspecting Feature Assignment

As mentioned above, the df.counts.raw data.frame contains the counts for each gene an sample. In addition, it contains some rows indicating the number of reads that were not aligned, ambiguous and for which no feature was found. This information is stored in the last rows of each counts file.

gene counts sequence_id
__no_feature 666901 S01
__ambiguous 192804 S01
__too_low_aQual 0 S01
__not_aligned 256021 S01
__alignment_not_unique 0 S01
__no_feature 1167785 S02
__ambiguous 382194 S02
__too_low_aQual 0 S02
__not_aligned 352425 S02
__alignment_not_unique 0 S02
__no_feature 819476 S03
__ambiguous 270242 S03
__too_low_aQual 0 S03
__not_aligned 212572 S03
__alignment_not_unique 0 S03
__no_feature 915897 S04
__ambiguous 294485 S04
__too_low_aQual 0 S04
__not_aligned 287451 S04
__alignment_not_unique 0 S04
__no_feature 829787 S05
__ambiguous 286712 S05
__too_low_aQual 0 S05
__not_aligned 372251 S05
__alignment_not_unique 0 S05
__no_feature 769118 S06
__ambiguous 255793 S06
__too_low_aQual 0 S06
__not_aligned 285180 S06
__alignment_not_unique 0 S06

2.1 Exercise:

We want to have information about the alignment process. For each sample, make a plot showing the percentage of reads that were not aligned, ambiguous and for which no feature was found.

2.2 Exercise:

For further analyses, we will need a data.frame that do NOT contain these rows. Create such data.frame and name it df.counts.

3 Inspecting Raw Count Data

The goal of this exercise is to perform a quality control of the count data. In order to do this you will visualize the data using different types of plots.

3.1 Total number of counts

The total number of counts for each sample might be different due to multiple factors.

3.1.1 Exercise:

Create a barplot showing the total number of counts for each sample.

How many reads do you (approximately) have from each sample? Are the samples different? Do you see outliers? Can you conclude that the values are consistent, or is a normalization step required?

3.2 Ribosomal RNA

Out of all the RNA present in a prokaryotic cell, roughly 85% is ribosomal (r)RNAs. Thus when sequencing RNA 85% of the reads will be on non-coding rRNA, while we are interested in mRNA. To improve sensitivity and dynamic range on the mRNA levels we remove the rRNA, called ribosomal RNA depletion. We should check if this procedure was effective.

3.2.1 Exercise:

Create a barplot showing the percentage of counts for each rRNA to the total number of counts.

What percentage of the reads is mapped to rRNA? Has each sample roughly the same amount of rRNA reads? What would be the effect of deviations in rRNA level when we compare the samples? Should we correct the data somehow for this?

3.3 Filtering out low-count genes and rRNA

Researchers often filter out the genes that are measured with a very little amount of reads. These measurements are then considered unreliable.

3.3.1 Exercise:

Create a new data.frame called df.counts.f in which you filter out all rRNA genes and all the genes that are measured with less than 24 reads in all samples. In addition, create a new column named counts_log2 that contains the counts in log space. We will need this column in the coming steps.

Hint: to avoid getting infinite values for genes with 0 counts, add 1 to all counts before you calculate the log. This is usually referred as pseudo-counts.

3.3.2 Exercise:

Did the filtering step have an impact on your data? Re-create the figure made in Excercise 2.1.1

3.4 Visualizing counts’ distribution:

Another way to check if data needs to be normalized is to visualize the distribution of the counts. This is usually done in log space.

3.4.1 Excercise:

Make a boxplot to visualize the distribution in gene expression level, for every sample.

Are the samples similar or different? Do you see outliers? Can you conclude that the values are consistent, or is a normalization step required?

3.4.2 Excercise:

Make a density plot to visualize the distribution in gene expression level, for every sample.

Are the samples similar or different? Do you see outliers? Can you conclude that the values are consistent, or is a normalization step required?

3.5 Sample comparison

An important aspect of quality control is comparing the samples. A direct way of comparing the samples is to plot the gene expression levels of one sample against the gene expression levels of another sample.

3.5.1 Excercise:

In order to be able to plot the log2 count values of two samples against each other, we need to transform the df.counts.f to wide format, such that we have one row per gene and one sample per column. Create a new data.frame, df.counts.f.wide, that has this format:

3.5.2 Excercise:

Now you can plot the log2 count values of two sample against each other. Which sample is most similar to sample 1? In what way do the other samples differ from sample 1?

3.5.3 Excercise:

It is somewhat challenging to compare all 6 samples with each other. You can use the ggpairs function from the GGally package to do this automatically.

What is your overall impression? Are all samples very similar or do you see differences? Can you give examples of large differences?

3.6 MA plots

Although this plot is very informative, it is often more informative and common to compare the gene expression values between samples (or conditions) using an MA plot. In an MA plot you plot the difference in gene expression for each gene on the y-axis. The total expression level is plotted on the x-axis. It was originally developed for microarray data (http://en.wikipedia.org/wiki/MA_plot), but is also widely used for RNA-seq data. It will become clear when you start making such plots:

3.6.1 Exercise:

In this MA plot you will compare sample 1 with sample 3:

First calculate the difference in log expression and then plot the difference against total expression:

The advantage is that differences from zero in the y-axis directly indicate a difference in log2 fold change. Each dot indicates a gene, and dots on the left side in the graph have a low expression level, and genes on the right side in the graph have a high expression level.

Note that log2 expression values are plotted. Are the samples similar? What can you conclude from this plot?

3.6.2 Excercise:

To analyze all 6 samples, it is handy to compare them all with a common reference. This common reference is the based median expression level of each gene across all samples. Use the df.counts.f data.frame to first calculate the log2 of the median counts per gene and then calculate M and A per sample.

3.6.3 Excercise:

Make a plot comparing every sample to the common reference.

Note that log2 expression values are plotted. Do you see any outliers?

3.7 PCA

An important plot for the Quality Control is the Principal Component Analysis plot. This allows you to infer the effect of the different experimental variables. In R, a PCA can be performed using prcomp() function. We already prepared helper functions for performing PCA’s and generating associated plots, you can find them in /Users/joeri/SurfDrive/MMP-UVA/Teaching/1920_SBiP/experiments/20200129_sbip/functions/pca.R.

(Adapted from https://tbradley1013.github.io/2018/02/01/pca-in-a-tidy-verse-framework )

We can use the f_pca function to perform the PCA on our data and also extract the most important information. It returns you a nested data.frame.

## Observations: 1
## Variables: 4
## $ data         <list> [<tbl_df[6 x 3328]>]
## $ pca          <list> [<2.802365e+01, 1.417579e+01, 4.776911e+00, 4.476511e+0…
## $ pca_aug      <list> [<tbl_df[6 x 3335]>]
## $ pca_loadings <list> [<tbl_df[19962 x 3]>]

First look at the amount of variance explained by each component using the f_pca_var_exp function.

## # A tibble: 6 x 4
##      pc variance  var_exp cum_var_exp
##   <dbl>    <dbl>    <dbl>       <dbl>
## 1     1 7.85e+ 2 7.52e- 1       0.752
## 2     2 2.01e+ 2 1.92e- 1       0.944
## 3     3 2.28e+ 1 2.18e- 2       0.966
## 4     4 2.00e+ 1 1.92e- 2       0.985
## 5     5 1.57e+ 1 1.50e- 2       1    
## 6     6 2.62e-29 2.50e-32       1

Or in plotted form using f_pca_plot_var_exp

Which components are most relevant for explaining the variance in our dataset?

Let’s inspect where our samples are placed on the first three components.

4 Normalization

The goal of this exercise is to perform a quality control of the gene expression count data, after normalization. In order to do this you will visualize the data using different types of plots. You will analyze your data using a methodology that is called DESeq, which is described in:

  1. https://genomebiology.biomedcentral.com/articles/10.1186/gb-2010-11-10-r106

And in

  1. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4302049/

4.1 Size factors

Normalizing count data is one of the steps in the DESeq workflow. It is done to correct for differences in sequencing depth among samples and it is based on the median-of-ratios method. You can find the mathematical formulation in link # 1 provided above. In short, size factors are calculated per sample and give a measure of the median distance all genes of one sample to the geometric mean of each gene across all samples.

4.1.1 Exercise:

To calculate the size factors we will follow the next steps (adapted from this tutorial):

  1. Create a pseudo-reference sample: for each gene, calculate the geometric mean of counts. Note: you can also use the log-average to calculate this but then you will have to modify the coming steps accordingly.

  2. Calculate ratio of each sample to the pseudo-reference: for each sample, calculate the ratio of each gene to the the pseudo-reference. Note: if the pseudo-reference is equal to 0, make this ratio NA.

  3. Calculate the size factor for each sample: this is done by calculating the median of all ratios per sample. Hint: set the argument na.rm = T in the median function.

  4. Normalize the count data: for each sample, divide the counts of each gene by the calculated size factor.

Print the calculate size factors for each sample in a table:

sequence_id size_factor
S01 0.8449753
S02 1.5748351
S03 1.1641719
S04 0.9563367
S05 0.8856490
S06 0.7758809

4.2 Visualizing the effect of normalization

Remake the plots you created in sections 2.1 and 2.4 using the normalized data

4.2.1 Exercise:

Plot the total number of counts per sample:

How many reads do you (approximately) have from each sample? Are the samples different? Do you see outliers? Has the normalization worked?

4.2.2 Exercise:

Make a boxplot to visualize the distribution of normalized counts, for every sample:

Are the samples similar or different? Do you see outliers? Can you conclude that the normalization worked?

4.2.3 Exercise:

You can also make density plots to visualize the distribution of normalized counts, for every sample:

Are the samples similar or different? Do you see outliers? Can you conclude that the normalization worked?

4.3 PCA

Repeat the PCA done section 2.7 using the normalized data.

4.3.1 Exercise:

Make a plot showing the explained variance for each component

4.3.2 Exercise:

Make a plot showing where out samples are place on the fist 3 components

Does this look different from the PCA with pre-normalized data? In what way?

4.3.3 Exercise:

We will now investigate which genes are contributing most to the difference between the samples along the first PC axis (PC1).

Most down-regulated genes of PC 1:

## # A tibble: 10 x 2
##    gene    loading
##    <chr>     <dbl>
##  1 sll1552  -0.227
##  2 sll0654  -0.201
##  3 sll0720  -0.180
##  4 nucH     -0.165
##  5 pstS     -0.152
##  6 pstC     -0.132
##  7 sphX     -0.130
##  8 pstA     -0.120
##  9 sll0594  -0.107
## 10 pstB     -0.105

Most up-regulated genes of PC 1:

## # A tibble: 10 x 2
##    gene      loading
##    <chr>       <dbl>
##  1 slr1634    0.137 
##  2 amiC       0.0995
##  3 sll1863    0.0794
##  4 slr0376    0.0750
##  5 glnH/glnP  0.0667
##  6 sll7064    0.0589
##  7 slr5037    0.0550
##  8 sll7065    0.0543
##  9 sll1862    0.0540
## 10 petJ       0.0534

Another useful tool is to plot the expression levels in a heatmap.

5 Differential Expression

The goal of this exercise is to find differentially expressed genes.

First, recall the experimental design. What are our experimental groups?

sample_id sequence_id phosphate_mM purpose
191226_PC_2_1 S01 0.0115 treatment
191226_PC_2_3 S02 0.0115 treatment
191226_PC_2_4 S03 0.0115 treatment
200116_PC_3_1 S04 0.2300 control
200116_PC_3_3 S05 0.2300 control
200116_PC_3_4 S06 0.2300 control

We are analyzing our data using the DESeq methodology. Instead of reinventing the wheel we will use the DESeq2 package. You can find more information about the theoretical background of the package in the papers provided above.

5.1 Preparing the data for DESeq2

DESeq2 needs as inputs, at least, 3 objects:

  • countData: a matrix containing genes in rows and samples in columns. Gene names should be set as the row names of the matrix.
  • colData: a data.frame with two columns, one containing the sample names and the other one containing the experimental conditions (in our case, control and treatment).
  • design: a formula indicating the name of the column in the colData object that contains the experimental conditions.

5.1.1 Exercise:

Create a data.frame containing the (filtered) counts data and the design data. It should look like this:

## # A tibble: 19,962 x 12
##    gene  counts sequence_id counts_log2 sample_id multicultivator channel strain
##    <chr>  <dbl> <chr>             <dbl> <chr>               <dbl>   <dbl> <chr> 
##  1 AT103     20 S01                4.39 191226_P…               0       1 WT    
##  2 ESI3      92 S01                6.54 191226_P…               0       1 WT    
##  3 ETR1     142 S01                7.16 191226_P…               0       1 WT    
##  4 Era      305 S01                8.26 191226_P…               0       1 WT    
##  5 HAGH1    163 S01                7.36 191226_P…               0       1 WT    
##  6 IAP75   3169 S01               11.6  191226_P…               0       1 WT    
##  7 ICFG     800 S01                9.65 191226_P…               0       1 WT    
##  8 Sac1     282 S01                8.14 191226_P…               0       1 WT    
##  9 aat       37 S01                5.25 191226_P…               0       1 WT    
## 10 abfB     136 S01                7.10 191226_P…               0       1 WT    
## # … with 19,952 more rows, and 4 more variables: sample_time <dttm>,
## #   phosphate_mM <dbl>, purpose <chr>, dataset <chr>

5.1.2 Exercise:

Create the object for countData. Transform the df.diff to wide format, set the genes as row names and convert to matrix. It should look like this:

##        S04  S05  S06  S01  S02  S03
## AT103   26   28   18   20   34   30
## ESI3    84   98   77   92  200  123
## ETR1   120  141   98  142  242  209
## Era    380  244  260  305  651  375
## HAGH1  196  139  121  163  313  248
## IAP75 3181 3046 2703 3169 5090 4282

5.1.3 Excercise:

Create the object for colData. It should look like this:

## # A tibble: 6 x 2
##   sequence_id purpose  
##   <fct>       <fct>    
## 1 S04         control  
## 2 S05         control  
## 3 S06         control  
## 4 S01         treatment
## 5 S02         treatment
## 6 S03         treatment

5.1.4 Exercise:

Create the DESeq2 object and estimate size factors to apply counts normalization

5.2 Counts normalization

The previous function call creates a DESeq2 object which, among many other things, contains the normalized counts and the estimated size factors. These factors should be equal to the ones we calculated previously.

5.2.1 Exercise:

Check if the previous statement is true.

sequence_id size_factor
S01 0.8449753
S02 1.5748351
S03 1.1641719
S04 0.9563367
S05 0.8856490
S06 0.7758809

You can access the normalized counts from the DESeq object by calling the counts function with normalized = TRUE. This is how the normalized data looks like:

##              S04        S05        S06        S01        S02        S03
## AT103   27.18708   31.61523   23.19944   23.66933   21.58956   25.76939
## ESI3    87.83518  110.65331   99.24203  108.87892  126.99742  105.65450
## ETR1   125.47882  159.20528  126.30804  168.05225  153.66688  179.52675
## Era    397.34961  275.50417  335.10296  360.95730  413.37662  322.11738
## HAGH1  204.94874  156.94705  155.95176  192.90505  198.75097  213.02696
## IAP75 3326.23447 3439.28563 3483.78197 3750.40554 3232.08446 3678.15098

5.3 Dispersion estimation

In RNA-seq count data there is a dependency between the variance and the mean that is addressed in the statistical procedures that are used for differential gene expression analysis. This plot visualizes the (overdispersed) mean-variance dependency in your normalized data:

Computing mean and variance:

DESeq2 resolves this issue using regression and shrinkage:

5.4 Differential expression

In order to extract the results of the DESeq analysis, we need to call the results() function on the previously created DESeq object. The following code extracts into a data.frame the log fold change, the p-values and adjusted p-values for differential expression.

gene baseMean log2FoldChange lfcSE stat pvalue padj
AT103 25.50501 -0.1916194 0.2986252 -0.6418231 0.5209880 0.6338481
ESI3 106.54356 0.1992797 0.1785884 1.1156377 0.2645773 0.3760105
ETR1 152.03967 0.2734443 0.1596078 1.7129936 0.0867137 0.1525630
Era 350.73467 0.1199726 0.1430446 0.8386645 0.4016576 0.5194266
HAGH1 187.08842 0.2142938 0.1507856 1.4210019 0.1553162 0.2458311
IAP75 3484.99051 0.0554742 0.0897230 0.6182854 0.5363872 0.6484594

The results can be plotted in an MA plot:

…and you can plot the expression values of for instance the 10 most differentially expressed genes:

5.4.1 Exercise:

How many genes have an adj p-value < 0.01?

Answer: 1207.

5.4.2 Excercise:

Draw an histogram of the p-values and the adjusted p-values

5.4.3 Excercise:

Create a volcano plot.

5.4.4 Exercise:

Print a table with all the genes with adj p-value < 0.01 and log fold change > 2.

gene baseMean log2FoldChange lfcSE stat pvalue padj
slr1634 19777.15878 -5.285370 0.1254559 -42.100258 0 0
amiC 9403.41756 -3.904825 0.0883051 -44.201980 0 0
slr1403 8251.46024 3.965997 0.0842125 47.062210 0 0
pstB 8098.19056 4.102974 0.0883609 46.396667 0 0
pstA 8596.19340 4.714814 0.0867471 54.275118 0 0
sphX 2605.54493 5.067331 0.1227929 40.961615 0 0
pstC 15500.87089 5.175105 0.0851707 60.692507 0 0
pstS 40614.10586 5.957979 0.0978243 60.855515 0 0
nucH 20802.56889 6.461794 0.0949813 67.846384 0 0
sll0654 30245.13918 7.852965 0.1019336 76.532346 0 0
sll0723 3622.55188 3.206341 0.0868611 36.884368 0 0
sll0594 1442.15122 4.188438 0.1163392 35.779311 0 0
sll0721 4270.71116 3.960210 0.1195234 33.088664 0 0
ppk 2413.24297 2.789296 0.0891573 31.260566 0 0
sll0722 914.01220 3.615031 0.1224730 29.350124 0 0
glnH/glnP 1595.09671 -2.598133 0.1024505 -25.343683 0 0
sll1291 3101.15950 2.053192 0.0858406 23.912676 0 0
sll1552 1690.38691 8.125828 0.2621408 23.555387 0 0
sll0685 763.61650 2.882493 0.1294801 22.196748 0 0
slr0607 652.42191 3.039183 0.1394017 21.710851 0 0
slr1593 1266.53182 2.510339 0.1176945 21.305953 0 0
slr0376 12707.49112 -2.874160 0.1368229 -21.004237 0 0
sll0595 295.19046 3.065807 0.1547884 19.584811 0 0
sll1292 1127.69864 2.108668 0.1083922 19.438854 0 0
hisI 596.84033 2.225320 0.1145008 19.398010 0 0
sll1293 881.49736 2.092038 0.1129283 18.506943 0 0
slr0096 640.03443 2.000738 0.1129830 17.684886 0 0
sll1483 582.45118 2.181477 0.1297207 16.787484 0 0
slr6042 174.36264 2.353801 0.1797602 12.985387 0 0
sll7065 264.30013 -2.112276 0.1636629 -12.876342 0 0
sll7064 134.45549 -2.272804 0.1815262 -12.442298 0 0
petJ 255.62540 -2.064304 0.1668064 -12.348665 0 0
sll0720 176.92265 5.795054 0.3339371 11.349673 0 0
sll1863 1971.77673 -2.709240 0.3116859 -8.694742 0 0
slr0364 754.24096 2.176862 0.2574683 8.446758 0 0
sll5123 45.90136 2.323667 0.2846438 7.893656 0 0

5.5 Export

DEseq data was exported to ./data/processed/transcriptomics/20200305_transcriptomics_deseq.csv